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Abstract. Choice models, which capture popular preferences over objects 
of interest, play a key role in making decisions whose eventual outcome is 
impacted by human choice behavior. In most scenarios, the choice model, 
which can effectively be viewed as a distribution over permutations, must be 
learned from observed data. The observed data, in turn, may frequently be 
viewed as (partial, noisy) information about marginals of this distribution 
over permutations. As such, the search for an appropriate choice model boils 
down to learning a distribution over permutations that is (near-)consistent 
with observed information about this distribution. 

In this work, we pursue a non-parametric approach which seeks to learn 
a choice model (i.e. a distribution over permutations) with sparsest possible 
support, and consistent with observed data. We assume that the data observed 
consists of noisy information pertaining to the marginals of the choice model 
we seek to learn. We establish that any choice model admits a 'very' sparse 
approximation in the sense that there exists a choice model whose support is 
small relative to the dimension of the observed data and whose marginals ap- 
proximately agree with the observed marginal information. Wc further show 
that under, what we dub, 'signature' conditions, such a sparse approximation 
can be found in a computationally efficiently fashion relative to a brute force 
approach. An empirical study using the American Psychological Association 
election data-set suggests that our approach manages to unearth useful struc- 
tural properties of the underlying choice model using the sparse approximation 
found. Our results further suggest that the signature condition is a potential 
alternative to the recently popularized Restricted Null Space condition for 
efficient recovery of sparse models. 



1. Introduction 

1.1. Background. It is imperative for an architect of a societal system, be it a road 
transportation system, energy distribution network, or the Internet, to deal with the 
uncertainty arising from human participation in general, and human choice behavior 
in particular. One possible approach to serve this end, is to make assumptions on 
the behavior of an individual (for instance, assuming that every individual is a 
rational utility maximizer). Such an assumption leads, in turn, to a collective 
behavioral model for the entire population. This model can subsequently be used 
to guide system design, e.g. where to invest resources to build new roads in the 
country or what sorts of products to put up for sale at a store. Such models, of 
the collectives preference of a population over objects of interest, are colloquially 
referred to as customer choice models, or simply choice models. As suggested by the 
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above discussion, choice models form crucial inputs to making effective decisions 
across a swathe of domains. 

Now in practice, a choice model is revealed through partial information people 
provide about their preferences via their purchase behavior, responses to polls, or 
in explicit choices they make. In assuming a behavioral model for the population 
one runs the risk of mis-modeling choice. Ideally, one wishes to learn a choice model 
consistent with observed partial preferences, having made little or no behavioral or 
structural assumptions. In the absence of such structural assumptions one needs 
a criterion to select a choice model from among the many that will likely agree 
with the observed partial preferences. A natural criterion here is structural 'sim- 
plicity' (a precise definition of which we defer for now). Since choice models are 
used as inputs to decision problems, it makes operational sense to seek a choice 
model that is structurally simple. In addition, a criterion of this sort is consistent 
with Occam's razor. Thus motivated, we consider here the question of efficiently 
learning a 'simple' choice model that is consistent with observed partial (marginal) 
information. 

1.2. Related prior work. There is a large literature devoted to learning struc- 
turally simple choice models from partial observations. Most prior work has focused 
on parametric approaches. Given the nature of the topic, the literature is quite di- 
verse and hence it is not surprising that the same choice model appears under 
different names in different areas. In what follows, we provide a succinct overview 
of the literature. 

1.2.1. Learning Parametric Models. To being with, the monograph by Diaconis |18| 
Chapter 9] provides a detailed history of most of the models and references given 
below. In the simplest setting, a choice model (which, recall, is simply a distribution 
over the permutations of N objects of interest) is captured by the order statistics 
of N random variables Y\, . . . , Yjy. Here Yi = Ui + Xj where the Ui are parameters 
and the Xi are independent, identically distributed random variables. Once the 
distributions of the JQs are specified, the choice model is specified. 

This class of models was proposed nearly a century ago by Thurstone [48] , A 
specialization of the above model when the XjS are assumed to be normal with 
mean and variance 1 is known as the Thurstone-Mosteller model. This is also 
known, more colloquially, as the probit model. 

Another specialization of the Thurstone model is realized when the XjS are 
assumed to have Gumbcl or Logit distributions (one of the extreme value distribu- 
tion). This model is attributed differently across communities. Holman and Marley 
established that this model is equivalent (see [53 j for details) to a generative model 
where the N objects have positive weights u>i, . . . , wn associated with them, and a 
random permutation of these N objects is generated by recursive selection (without 
replacement) of objects in the first position, second position and so on with selection 
probabilities proportional to theirs weights. As per this, the probability of object i 
being preferred over object j (i.e. object i is ranked higher compared to object j) 
is Wi/(wi + Wj). The model in this form is known as the Luce model |33| as also 
the Plackett model [43r| Finally, this model is also refereed to as the Multinomial 



It is worth noting that this model is very similar the the Bradley- Terry model [9], where each 
object i has weight ui; > associated with it; the Bradley- Terry model is however distinct from the 
model proposed by Plackett and Luce in the probabilities it assigns to each of the permutations. 
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Logit Model (MNL) after McFadden referred to it as a conditional logit model [35]; 
also see |17j . We will adopt the convention of referring to this important model as 
the MNL model. 

The MNL model is of central importance for various reasons. It was introduced 
by Luce to be consistent with the axiom of independence from irrelevant alterna- 
tives (HA). The model was shown to be consistent with the induced preferences 
assuming a random utility maximization framework whose inquiry was started by 
Marschak [3S1 |37] . Very early on, simple statistical tests as well as simple estima- 
tion procedure were developed to fit such a model to observed data 38J. Now the 
IIA property possessed by the MNL model is not necessarily desirable as evidenced 
by empirical studies. Despite such structural limitations, the MNL model has been 
widely utilized across application areas primarily due to the ability to learn the 
model parameters easily from observed data. For example, see [391 HI !4Uj for ap- 
plication in transportation and |26l 134] for applications in operations management 
and marketing. 

With a view to addressing the structural limitations of the MNL model, a num- 
ber of generalizations to this model have been proposed over the years. Notable 
among these is the so-called 'nested' MNL model, as well as mixtures of MNL mod- 
els (or MMNL models). These generalizations avoid the IIA property and continue 
to be consistent with the random utility maximization framework at the expense 
of increased model complexity; see [3J SI IS1 HH SI]- The interested reader is also 
referred to an overview article on this line of research by McFadden [ID]- While 
generalized models of this sort are in principle attractive, their complexity makes 
them difficult to learn while avoiding the risk of over-fitting. More generally, spec- 
ifying an appropriate parametric model is a difficult task, and the risks associated 
with mis-specification are costly in practice. For an applied view of these issues 
see [H[271[T7]. Thus, while these models are potentially valuable in specific well 
understood scenarios, the generality of their applicability is questionable. 

As an alternative to the MNL model (and its extensions), one might also con- 
sider the parametric family of choice models induced by the exponential family of 
distributions over permutations. These may be viewed as choice models that have 
maximum entropy among those models that satisfy the constraints imposed by the 
observed data. The number of parameters in such a model is equal to the number 
of constraints in the maximum entropy optimization formulation, or equivalcntly 
the effective dimension of the underlying data (cf. the Koopman-Pitman-Darmois 
Theorem |31j). This scaling of the number of parameters with the effective data 
dimension makes the exponential family obtained via the maximum entropy prin- 
ciple very attractive. Philosophically, this approach imposes on the choice model, 
only those constraints implied by the observed data. On the flip side, learning the 
parameters of an exponential family model is a computationally challenging task 
(see [TB], [S] and [52]) as it requires computing a "partition function" possibly over 
a complex state space. 

1.2.2. Learning Nonparametric Models. As summarized above, parametric models 
either impose strong restrictions on the structure of the choice model and/or are 
computationally challenging to learn. To overcome these limitations, we consider a 
nonparametric approach to learning a choice model from the observed partial data. 

The given partial data most certainly does not completely identify the underlying 
choice model or distribution over permutations. Specifically, there are potentially 



SPARSE CHOICE MODELS 



4 



multiple choice models that are (near) consistent with the given observations, and 
we need an appropriate model selection criterion. In the parametric approach, one 
uses the imposed parametric structure as the model selection criterion. On the other 
hand, in the nonparametric approach considered in this paper, we use simplicity, 
or more precisely the sparsity or support size of the distribution over permutations, 
as the criterion for selection: specifically, we select the sparsest model (i.e. the 
distribution with the smallest support) from the set of models that are (near) 
consistent with the observations. This nonparametric approach was first proposed 
by Jagabathula and Shah [29, 30] and developed further by Farias, Jagabathula and 
Shah [22 122] • Following [2HJ EES I2H [22] , we restrict ourselves to observations that 
are in the form of marginal information about the underlying choice model. For 
instance, the observations could be in the form of first-order marginal information, 
which corresponds to information about the fraction of the population that ranks 
object i at position j for all 1 < i, j < N, where N is the number of objects. 

A major issue with the identification of sparse models from marginal informa- 
tion is the associated computational cost. Specifically, recovering a distribution 
over permutations of N objects, in principle, requires identifying probabilities of 
N\ distinct permutations. The distribution needs to be recovered from marginal 
information, which can usually be cast as a lower dimensional "linear projection" of 
the underlying choice model; for instance, the first-order marginal information can 
be thought of as a linear projection of the choice model on the (N — l) 2 dimensional 
space of doubly stochastic matrices. Thus, finding a sparse model consistent with 
the observations is equivalent to solving a severely underdetermined system of lin- 
ear equations in TV! variables, with the aim of finding a sparse solution. As a result, 
at a first glance, it appears that the computational complexity of any procedure 
should scale with the dimension of the variable space, N\. 

In |29| I5U] |2U 122], the authors identified a so called 'signature condition' on 
the space of choice models and showed that whenever a choice model satisfies the 
signature condition and noiseless marginal data is available, it can be exactly recov- 
ered in an efficient manner from marginal data with computational cost that scales 
linearly in the dimension of the marginal data ((iV — l) 2 for first-order marginals) 
and exponentially in the sparsity of the choice model. Indeed, for sparse choice 
models this is excellent. They also established that the 'signature condition' is not 
merely a theoretical construct. In fact, a randomly chosen choice model with a 
"reasonably large" sparsity (support size) satisfies the 'signature condition' with 
a high probability. The precise sparsity scaling depends on the type of marginal 
data available; for instance, for the first-order marginals, the authors show that a 
randomly generated choice model with sparsity up to 0(N log N) satisfies the 'sig- 
nature conditions'. In summary, the works of [2H] [3U1 H2 HI] establish that if the 
original choice model satisfies the 'signature condition' (e.g. generated randomly 
with reasonable sparsity) and the available observations are noise-free, then the 
sparsest choice model consistent with the observations can be recovered efficiently 

1.3. Our contributions. In reality, available data is not noise-free. Even more 
importantly, the data might arise from a distribution that is not sparse to begin 
with! The main contribution of the present paper is to address the problem of 
learning non-parametric choice models in the challenging, more realistic case when 
the underlying model is potentially non-sparse and the marginal data is corrupted 
by noise. Specifically, we consider the problem of finding the sparsest model that 
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is near consistent with - or equivalently, within a "distance" e of - the marginal 
data. We consider the setting in which the marginal information can be cast as 
a linear projection of the underlying choice model over a lower dimensional space. 
We restrict ourselves primarily to first- order marginal information throughout this 
paper; a discussion about how our methods and results extend to general types of 
marginal information is deferred to the end. 

In this context, we consider two main questions: (1) How does one find the 
sparsest consistent distribution in an efficient manner? and (2) How "good" are 
sparse models in practice? Next, we describe the contributions we make towards 
answering these questions. 

In order to understand how to efficiently find sparse models approximating the 
given data, we start with the more fundamental question of "how sparse can the 
sparsest solution be?" To elaborate further, we first note that the space of first- 
order marginal information is equivalent to the space of doubly stochastic matrices 
(this equivalence is explained in detail in subsequent sections). Given this, finding 
the sparsest choice model that is near consistent with the observations is essentially 
equivalent to determining the convex decompositions (in terms of permutations) of 
all doubly stochastic matrices that are within a ball of "small" radius around the 
given observation matrix and choosing the sparsest convex decomposition. Now, it 
follows from the celebrated Birkhoff-von Neumann's theorem (see [7] and [5T]) that 
a doubly stochastic matrix belongs to an (N — l) 2 dimensional polytope with the 
permutation matrices as the extreme points. Therefore Caratheodory's theorem 
tells us that it is possible to find a convex decomposition of any doubly stochastic 
matrix with at most (N — l) 2 + 1 extreme points, which in turn implies that the 
sparsest model consistent with the observations has a support of at most (N — 
l) 2 + 1 = Q(N 2 ). We raise the following natural question at this point: given any 
doubly stochastic matrix (i.e. any first-order marginal data), does there exist a 
near consistent choice model with sparsity significantly smaller than Q(N 2 )? 

Somewhat surprisingly, we establish that in as much as first-order marginals 
are concerned, any choice model can be e-approximated by a choice model with 
sparsity or support 0(N/e 2 ). More specifically, we show that any non-negative 
valued doubly stochastic matrix can be e-approximated in the £2 sense by a convex 
combination of 0(N/e 2 ) permutation matrices. This is significantly smaller than 

e(iv 2 ). 

The next question pertains to finding such a sparse model efficiently. As men- 
tioned above, the signature conditions have played an important role in efficient 
learning of sparse choice models from noise- free observations. It is natural to ask 
whether they can be useful in the noisy setting. More precisely, can the first-order 
marginals observed be well approximated by those of a choice model from the sig- 
nature family, and if so, can this be leveraged to efficiently recover a sparse choice 
model consistent with observations. 

To answer the first question, we identify conditions on the original choice models 
such that they admit e-approximations by sparse choice models that satisfy the 
'signature' conditions and that have sparsity 0(N/e 2 ). We establish that for a 
very large class of choice models, including very dense models such as the models 
from the MNL or exponential family, the observed marginal information can be 
well approximated by a sparse choice model from the signature family. We are 
then able to use this result to our advantage in designing a novel algorithm for 
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the recovery of a sparse choice model given noisy first order marginal information. 
In particular, in leveraging this result, our algorithm uses structural properties 
of models in the signature family along with an adaptation of the multiplicative 
weights update framework of Plotkin-Shmoys-Tardos |44j . Our algorithm finds a 
sparse choice model with sparsity 0(K log N) in time exp (Q(KlogN)) if there 
exists a choice model in the signature family with sparsity K that approximates 
the data well; our structural result guarantees the existence of such approximations 
for suitable K. 

To start with, this is much (exponentially in N) better than the brute force 
search which would require (^') ~ exp (<d(KN log N)) computation. Given that 
for a large class of choice models, their marginal data is well approximated by a 
signature family choice model with sparsity essentially O(N), the computation cost 

is bounded by exp (0(N log TV)) which is (AH)° < ' 1 ' ) - polynomial in the dimension, 
N\, of the ambient data. This is on an equal footing with many of the recently de- 
veloped sparse model learning methods under the framework of compressed sensing. 
These latter methods are based on convex optimization (typically, linear program- 
ming) and have computational cost that grows polynomially in the dimension of 
the ambient data. 

We establish the effectiveness of our approach by applying our sparse model 
learning procedure to the well studied American Psychological Association's (APA) 
ranked election data (i.e., the data used by Diaconis in [IS]). Interestingly enough, 
through sparse model approximation of the election data, we find structural infor- 
mation in the data similar to that unearthed by Diaconis |19j . The basic premise 
in |19j was that by looking at linear projections of the ranked election votes, it may 
be possible to unearth hidden structure in the data. Our sparse approximation 
captures similar structural information from projected data suggesting the utility 
of this approach in unearthing non-obvious structural information. 

1.3.1. Thematic Relation: Compressed Sensing. We note that this work is themat- 
ically related to the recently developed theory of compressed sensing and streaming 
algorithms, and further to classical coding theory and signal processing cf. [4l)l 142) . 
In the compressive sensing literature (see [T31 HU EH HU HO] ) > the goal is to estimate 
a 'signal' by means of a minimal number of measurements. Operationally this is 
equivalent to finding the sparsest signal consistent with the observed (linear) mea- 
surements. In the context of coding theory, this corresponds to finding the most 
likely transmitted codeword given the received message [231 ED US [35] . In the con- 
text of streaming algorithms, this task corresponds to maintaining a 'minimal' data 
structure to implement algorithmic operations [501 HH1 HSJ 121] • In spite of the 
thematic similarity, existing approaches to compressive sensing are ill-suited to the 
problem at hand; see |3U] . where the authors establish that the generic Restricted 
Null Space condition - a necessary and sufficient condition for the success of the 
convex optimization in finding sparsest possible solution - is not useful in the set- 
ting considered here. In a nutshell, the 'projections' of the signal we observe are a 
given as opposed to being a design choice. Put another way, the present paper can 
be viewed as providing a non-trivial extension to the theory of compressive sensing 
for the problem of efficiently learning distributions over permutations. 

1.4. Organization. The rest of the paper is organized as follows. In Section [2] the 
precise problem statement along with the signature condition is introduced. We also 
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introduce the MNL and exponential family parametric models. The main results 
of this paper are stated in Section [3] These results are established in Section [4] In 
Section [5] we study an application of our results to the popular benchmark APA 
data set. Using a simple heuristic motivated by the signature condition, we learn a 
sparse approximation of the observed data. We discuss the relevance of the sparse 
approximation thus obtained and conclude that it provides positive support to the 
quest of searching for sparse choice models. Finally, we conclude in Section [6] with 
a discussion on how the methods we propose for first-order marginal data extend 
to general types of marginal data. 

2. Setup 

Given N objects or items, we are interested in a choice model or distribution 
over permutations of these N items. Let Sn denote the space of N\ permutations 
of these N items. A choice model (equivalcntly, a distribution over Sn) can then 
be represented as a vector of AH dimension with non-negative components, all of 
them summing up to 1. The observations we consider here are certain marginal 
distributions of the choice model. Specifically, throughout this paper, we primarily 
restrict ourselves to first-order marginal information. We point out how our results 
extend to general marginal information in the discussion (Section [6]). 

More precisely, let A denote a choice model or a distribution over Sn- Then, the 
first-order marginal information, M(A) = [My (A)], is an N x TV doubly stochastic 
matrix with non-negative entries defined as 

Mij{\) = K^M0=j}, 

where a G Sn represents a permutation, o~(i) denotes the rank of item i under 
permutation a, and is the standard indicator with l{true} = 1 an d l{f a iso} = 0- 
We assume that there is a ground-truth choice model A and the observations 
are a noisy version of M. Specifically, let the observations be D = M + r\ so that 
ll^lb < S for some small enough 5 > 0; by ||?y||2 we mean 

iv 

IMIl = Vij- 

Without loss of generality, we assume that D is also doubly stochastic (or else, it 
is possible to transform it into that form). The goal is to learn a choice model or 
distribution A over Sn so that it's first-order marginal M(A) approximates D (and 
hence M) well and the support of A, ||A||o is small. Here 

||A|| =|{<7eS^:A((T)>0}|. 

Indeed, one way to find such a A is to solve the following program: for a choice of 
approximation error s > 0, 

(2.1) minimize \\/J,\\o over choice models fi 

such that \\M(fi) - D\\ 2 < e. 
By the Birkhoff-Von Neumann theorem and Caratheodary's theorem (as discussed 



earlier), there must exist a solution, say A, to (2.1l with ||A||o < (N — l) 2 + 1. 
Therefore, a solution to program (2.1 ) with guarantee ||A|| <(TV — 1) 2 + 1 can be 
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achieved by a linear programing relaxation of (2.1 1. The basic question is, whether 



such a solution is near optimal. Putting another way 

Question 1. Given any doubly stochastic matrix D, does there exist a choice model 
A with sparsity significantly smaller than Q(N 2 ) such that ||M(A) — D\\2 < e. 

Geometrically speaking, the question above translates to: given a ball of radius 
e around D, is there a subspace spanned by K extreme points that intersects the 
ball, for any doubly stochastic matrix D and some K that is significantly smaller 
than @(N 2 )7 



Now if the sparsest solution has sparsity K, then the brute- force approach would 
require searching over (^') « exp(<d(KN log N)) options. The question here is 
whether this could be improved upon significantly. That is, 



Question 2. Is it possible to solve (2.1 1 with a running time complexity that is far 
better than exp(<d(KN log AT)), at least for a reasonable large class of observations 
D? 



We obtain a faster algorithm by restricting our search to models that belong 
to the signature family that was introduced in earlier work [2H1 EH HU 122] ■ The 
structure of the family allows for efficient search. In addition, we can establish that 
the signature family is appropriately "dense", thereby by restricting our search to 
the signature family, we are not losing much. We now quickly recall the definition 
of the signature family: 



Signature family. A distribution (choice model) A is said to be- 
long the signature family if for each permutation a that is in the 
support (i.e., A(er) > 0) there exist an pair i, j such that a(i) = j 
and <j' (i) ^ j for any permutation a' in the support. Equivalently, 
for every permutation a in the support of A, there exists a pair i, j 
such that a ranks i at position j, but no other permutation in the 
support ranks i at position r. 



The above definition states that each element in the support of A has its 'sig- 
nature' in the data. Before we describe our answers to the questions above, we 
introduce two parametric models that we make use of later. 



2.1. Multinomial Logit (MNL) model. Here we describe the version of the 
model as introduced by Luce and Plackett [131 [33] • This is a parametric model with 
N positive valued parameters, one each associated with each of the A" items. Let 
Wi > be parameter associated with item i. Then the probability of permutation 
a G Sn is given by (for example, see [55] ) 

N 

(2.2) p tu (<r) = n ■ 

J=i W <^- 1 U) + w <y- 1 U+i) H 1" W cr- 1 (N) 

Above, cr _1 (j) = i if cr(i) = j. 
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2.2. Exponential family model. Now we describe an exponential family of dis- 
tributions over permutations. The exponential family is parametrized by ./V 2 pa- 
rameters Oij for 1 < i, j < N . Given such a vector of parameters 9, the probability 
of a permutation a is given by 



? 0) cx exp [ 

A<i,j<N 



( 2 - 3 ) = ^Tm CX P I Y] % CT 



Z(6) 



.l<i,j<N 



where Z(6) = J2 a es N ex P fei<i,j<jv ; ) I a H = 1 iff a ( i ) = 3 and U H = 
otherwise. It is well known that with respect to the space of all first-order marginal 
distributions, the above described exponential family is dense. Specifically, for 
any doubly stochastic matrix (the first-order marginals) M — [My] with My > 
for all i,j, there exists 9 £ ~R NxN so that the first-order marginal induced by the 
corresponding exponential family is precisely M . An interested reader is referred to, 
for example, monograph [52] for details on this correspondence between parameters 
of the exponential family and its moments. 

3. Main results 

As our main results, we provide answers to the two questions raised above. We 
provide the answers to each of the questions in turn. 

On Question 1 (Sparse Approximation) : As our first result, we establish that given 
any doubly stochastic matrix D and e > 0, there exists a model A with sparsity 
0(N/e 2 ) such that ||M(A) - D\\ 2 < e. Thus, we show that by allowing a "small" 
error of e, one can obtain a significant reduction from Q(N 2 ) to 0(N/e 2 ) in the 
sparsity of the model that is needed to explain the observations. More precisely, 
we have the following theorem. 

Theorem 3.1. For any doubly stochastic matrix D and e € (0,1), there exists a 
choice model A such that ||A|| = 0(N/e 2 ) and ||M(A) — D\\2 < £• 

We emphasize here that this result holds for any doubly stochastic matrix D. In 
such generality, this result is in fact tight in terms of the dependence on N of the 
required sparsity. To see that, consider the uniform doubly stochastic matrix D 
with all of its entries equal to \/N . Then, any choice model A with o(N) support 
can have at most N x o(N) = o(N 2 ) non-zero entries, which in turn means that 
the 4 error ||M(A) - D\\ 2 is at least ^(N 2 ~ o(N 2 ))/N 2 w 1 for large N. 

The result of Theorem |3 . 1 1 also justifies why convex relaxations don't have any 
bite in our setting. Specifically, suppose we are given a doubly stochastic matrix D 
and a tolerance parameter e > 0. Then, all the consistent choice models A, which 
satisfy ||A/(A) — D\\ Q < e, have the same £i norm. We claim that "most" of such 
consistent models A have sparsity 0(7V 2 ). More precisely, following the arguments 
presented in the proof of [221 Theorem 1], we can show that the set of doubly 
stochastic matrices D such that \\D — D\\2 < £ and can be written as M(A) = D for 
some model A with sparsity K < (N — l) 2 has an (N — l) 2 dimensional volume of 
zero. It thus follows that picking an arbitrary consistent model A will most certainly 
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yield a model with sparsity 0(A 2 ); this is a factor A off from the sparsest solution, 
which has a sparsity of O(N) (ignoring the e dependence). 

On Question 2 (Efficient Algorithms): We now consider the question of efficiently 



solving the program in (2.1 1. As explained above, a brute-force search for a model of 
sparsity A that is consistent with the data requires searching over exp(0(AA log A)) 
options. We now show that by restricting ourselves to a reasonably large class of 
choice models, we can improve the running time complexity to O(exp(0(Alog A))) 
- effectively eliminating a factor of A from the exponent. More precisely, we can 
establish the following result. 

Theorem 3.2. Given a noisy observation D and e G (0, 1/2), suppose there exists 
a choice model A in the signature family such that ||A|| = K and \\D — M(X)\\2 < £■ 
Then, with a running time complexity of exp (0(A'logA)), we can find a choice 



model A such that 



0(e~ 2 K log N) and ||M(A) - D\\ 2 < 2s. 



Several remarks are in order. The proof of Theorem |3.2| is constructive in the 
sense that it proposes an algorithm to find a sparse model with the stated guar- 



antees. The result of Theorem 3.2 essentially establishes that as long as there is a 
sparse choice model of sparsity A in the signature family that is an e-fit to the ob- 
servations D, we can shave off a factor of A in the exponent from the running time 
complexity at the cost of finding a model with sparsity that is essentially within a 
factor of log A of A. In other words, we can obtain an exponential reduction in the 
running time complexity at the cost of introducing a factor of log A in the sparsity. 
It is worth pausing here to understand how good (or bad) the computation 



cost of exp (6 (A log A)) is. As discussed below (in Theorem 3.3 1, for a large 
class of choice models, the sparsity A scales as 0(e~ 2 N), which implies that the 
computation cost scales as exp (0( A log A)) (ignoring e to focus on dependence 
on A). That is, the computational cost is polynomial in A! = exp (0(Alog A)), 
or equivalently, polynomial in the dimension of the ambient space. To put this 
in perspective, the scaling we obtain is very similar to the scaling obtained in the 
recently popular compressive sensing literature, where sparse models are recovered 
by solving linear or convex programs, which result in a computational complexity 
that is polynomial in the ambient dimension. 

Finally, the guarantee of Theorem |3.2| is conditional on the existence of a sparse 
choice model in the signature family that is an e-fit to the data. It is natural to 
wonder if such a requirement is restrictive. Specifically, given any doubly stochastic 
matrix D, there are two possibilities. Firstly, it may be the case that there is no 
model in the signature family that is an e-fit to the data; in such a case, we may 
have to lose precision by increasing e in order to find a model in the signature family. 
Secondly, even if there did exist such a model, it may not be "sparse enough"; in 
other words, we may end up with a solution in the signature family whose sparsity 
scales like 0(A 2 ). Our next result shows that both scenarios described above do 
not happen; essentially, it establishes that the signature family of models is "dense" 
enough so that for a "large" class of data vectors, we can find a "sparse enough" 
model in the signature family that is an e-fit to the data. More specifically, we 
can establish that the signature family is "dense" as long as the observations are 
generated by an MNL model or an exponential family model. 
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(3-D y 



Theorem 3.3. Suppose D is a noisy observation of first- order marginal M(A) with 
\\D — M(A)||2 < s for some e S (0,1/2) and choice model A such that 

1. either, A is an MNL model with parameters w±, . . . , wn (and without loss 
of generality w\ < u>2 < ■ ■ ■ < wn ) such that 

w N ^ ^/logN 

for L — N s for some S £ (0, 1); 

2. or, A is an exponential family model with parameters 9 such that for any set 
of four distinct tuples of integers ji),(i2,, 32), {hji), an d (with 
l<ik,3k<N for\<k<A) 

(3.2) e XP (0n + ) 

exp (6 i3j3 + 6 i4 j 4 ) 

Then, there exists a A in the signature family such that: \\D — A||2 < 2e and 



= o(n/e 2 }. 



Remark. The conditions pTTl) and (3.2 1 can be further relaxed by replacing ylogiV 
(in both of them) by C log N/e 2 for an appropriately chosen (small enough) constant 
C > 0. For the clarity of the exposition, we have chosen a somewhat weaker 
condition. 



We have established in Theorem 3.3 that (under appropriate conditions) the rich 
families of MNL and exponential models can be approximated by sparse models in 
signature families as far as first-order marginals are concerned. Note that both 
families induce distributions that are full support. Thus, if the only thing we care 
about are first-order marginals, then we can just use sparse models in the signature 
family with sparsity only O(N) (ignoring e dependence) rather than distributions 
that have full support. It is also interesting to note that in Thcorcm |3.1| we establish 
the existence of a sparse model of 0(N/e 2 ) that is an e-fit to the observations. The 
result of Theorem |3.3| establishes that by restricting to the signature family, the 
sparsity scaling is still 0(N/e 2 ) implying that we are not losing much in terms of 
sparsity by the restriction to the signature family. 

In the next section we present the proofs of Theorems 3.1|3.3 before we present 
the results of our empirical study. 

4. Proofs 



4.1. Proof of Theorem 3.1 . We prove this theorem using the probabilistic method. 
Given the doubly stochastic matrix D, there exists a choice model (by Birkhoff-von 
Neumann's result) A such that M(A) = D. Suppose we draw T permutations (sam- 
ples) independently according to the distribution A. Let A denote the empirical 
distribution based on these T samples. We show that for T = N/e 2 , on average 
||M(A) — D\\ 2 < e. Therefore, there must exist a choice model with T = N/e 2 
support size whose first-order marginals approximate M within an £2 error of e. 

To that end, let ax, 02, . . . , <xt denote the T samples of permutations and A be the 
empirical distribution (or choice model) that puts l/T probability mass over each 
of the sampled permutations. Now consider a pair of indices 1 < i, 3 < N. Let Xjj 
denote the indicator variable of the event that a t (i) —3. Since the permutations are 
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drawn independently and in an identically distributed manner, Xjj are independent 
and identically distributed (i.i.d.) Bernoulli variables for 1 < t < T. Further, 

F(Xtj = l) = E[X&] = Dij. 

Therefore, the component of the first-order marginal M(X) of A is the empir- 
ical mean of a Binomial random variable with parameters T and Aj , denoted by 
B(T,Dij). Therefore, with respect to the randomness of sampling, 



E 



T 



t=l 



A 



1 

JT2 

1 

2^2 



Var(S(T,Aj) 



TAi(l- A;) 



< 



D 



T ' 



(4.1) 

where we used the fact that Dij £ [0, 1] for all 1 < i,j < N. Therefore, 



E 



|M(A) 



D\\l 



E 



T 

t=i 



< ^ Aj 



(4.2) 



T 



where the last equality follows from the fact that D is a doubly stochastic matrix 
and hence its entries sum up to N. From (4.2 1, it follows that by selecting T = N/e 2 , 
the error in approximating the first-order marginals, ||M (A) — AI2, ^ s w hhin e on 
average. Therefore, the existence of such a choice model follows by the probabilistic 
method. This completes the proof of Theorem |3.1| 



4.2. Proof of Theorem |3.3[ We prove Theorem |3.3| using the probabilistic method 
as well. As before, suppose that we observe D, which is a noisy version of the first- 
order marginal M(A) of the underlying choice model A. As per the hypothesis of 
Theorem |3.3| we shall assume that A satisfies one of the two conditions: either 
it is from MNL model or from exponential family with regularity condition on its 
parameters. For such A, we establish the existence of a sparse choice model A that 
satisfies the signature conditions and approximates M(A) (and hence approximates 
D) well. 

As in the proof of Theorem |3.1[ consider T permutations drawn independently 
and in an identical manner from distribution A. Let A be the empirical distribution 
of these T samples as considered before. Following arguments there, we obtain 



(like Q) that 
(4.3) 



E 



||M(A)-M(A)| 



< 



N 



For the choice of T = 4N/e 2 , using Markov's inequality, we can write 

1 



(4.4) 



\\M(X)-M(X)\\i>e- 



< 
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Since \\M(X)-D\\ 2 < e, it follows that \\M(\)-D\\ 2 < 2e with probability at least 
3/4. 

Next, we show that the A thus generated satisfies the signature condition with a 
high probability (at least 1 /2) as well. Therefore, by u nion bound we can conclude 

with probability at least 1/4. 



3.3 



that A satisfies the properties claimed by Theorem ', 

To that end, let E t be the event that cr t satisfies the signature condition with 
respect to set (a\, . . . , e>r). Since all o\, . . . , <jt are chosen in an i.i.d. manner, the 
probability of all the events are identical. We wish to show that P( U\<t<T Efj < 
1/2. This will follow from establishing TV(E1) < 1/2. To establish this, it is 
sufficient to show that V(E{) < 1/N 2 because T = AN/e 2 . 

To that end, suppose o\ is such that eri(l) = h, ■ ■ ■ , <Ji(N) — i^. Let Fj — 

follows that 



2 < t < T}. Then by the definition of the signature condition, it 



Ei - uf =1 F 



Therefore, 



< 



N 



3-1 



(4.5) 



o (^)n p (^ c ln^ 



3=2 



We will establish that the right hand side of (4.5 1 is bounded above by 0(1/ N 2 ) and 
hence TP(Ef) = 0(e~ 2 /N) < 1/2 for N large enough as desired. To establish this 
bound of 0(1/ N 2 ) under two different conditions stated in Theorem 3.3 we consider 



in turn the two cases: (i) A belongs to the MNL family with condition (3.1l satisfied, 
and (ii) A belongs to the max-ent exponential family model with the condition (3.2) 
satisfied. 



Bounding (4.5 1 under MNL model with (3.1 1. Let L — N for some 5 > as in the 



hypothesis of Theorem 3.3 under which (3.1 1 holds. Now 

fA = i 



i 



i -: 



(ot(l)^ii; 2<t<T 
^2(1)7^1 



T-l 



(4.6) 



For j > 2, in order to evaluate V(F' 



Efc=l w k 



T-l 



we shall evaluate 1 

To evaluate P(-Fj| F^ c ), note that the conditioning event n^~ 1 F^ c suggests that 
for each at, 2 < t <T, some assignments (ranks) for the first j —I items are given 
and we need to find the probability that jth item of each of the . . . ,o~x are 
not mapped to ij. Therefore, given C[ 3 fZ\F^, the probability that a 2(3) does map 
to ij is Wj/(J2k<£x w k)> where X is the set of N — j + 1 elements that does not 
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include the j — 1 elements to which 02(1), . . . , 02 (j — 1) are mapped to. Since by 
assumption (without loss of generality) , W\ < ■ ■ ■ < wjy, it follows that Efcex w k — 
J2k=i +1 w k- Therefore, 

i-i 

(4.7) p(^-|f|i^)>(l 



it-, 



N-j + l 



Z^fc=i 



id/, 



T-l 



Therefore, it follows that 



(4.8) 



p(ej) < n [ 
<n[! 
<n[i 

i 



1-1- 



Wat 
v^JV-L+1 

E*=i w fe 



T-l 



T-l 



T-l 



JV-L+1 



k=l 



T-l 



3.3 



Let W(L,N) — WAr/(EfcLi L+1 w fc)- -By hypothesis of Theorem 
W(L, N) < y/log N/N and L = N s . Therefore, from above it follows that 



it follows that 



D (X c ) 



< 



yibgTV 



N 



T-l 



(4.9) 



< 



1 -e 



(exp ( 



N 



where we have used the fact that 1 — x = exp(— a;)(l + 0(x 2 )) for x E [0, 1] (with 
x = yfiogN/N) and since T = N/e, (1 + 0(log N/N 2 )) T = 1 + o(l) = 6(1). Now 



(4.10) 



exp 



( 



ryl5glv 

TV 



exp 



( - Ay/logN/e 



< 1. 



Therefore, using the inequality 1 — x < exp(— x) for x € [0, 1], we have 



(4.11) 



£?) < exp ( - J Lexp(-4VloglV/e 2 )y 



Since L = N s for some 8 > and cxp(-4 v / log iV/e 2 ) = o(N & / 2 ) for any 5 > 0, it 
follows that 



(4.12) 



<exp(-6(A^ 2 )) 
< 0(l/iV 2 ). 



Therefore, it follows that all the T samples satisfy the signature condition with 
respect to each other with probability at least 0(l/N) < 1/4 for N large enough. 
Therefore, we have established the existence of desired sparse choice model in sig- 
nature family. This completes the proof of Theorem |3.3| under MNL model with 



condition (3.1 



Bounding (4.5 1 under exponential family model with (3.2). As before, let L = N 
for some 8 > (the choice of 8 > here is arbitrary; for simplicity, we shall think 
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of this S as being same as that used above). Now 



= i -: 



i 7 : 



1-P(a t (l) ^h; 2<t<T 



(4.13) 



1-P <T2(l)^il 



T-l 



To bound the right hand side of ( 4.13[ ), we need to carefully understand the im- 
plication of (3.2 1 on the exponential family distribution. To start with, suppose 
parameters 9ij are equal for all 1 < i,j < N. In that case, it is easy to see that 
all permutations have equal (1/AM) probability assigned and hence the probability 
P((72(l) 7^ ii) equals 1 — l/N. However such an evaluation (or bounding) is not 
straightforward as the form of exponential family involves the 'partition' function. 
To that end, consider 1 < i ^ i' < N. Now by definition of exponential family (and 
(72 is chosen as per it), 



<ra(l)=i = 



1 



(4.14) 



Z(9) 
exp (Vii 



^ exp (y^6lfc;g fc i) 

ereSjv(l^i) kl 



Z{6) 



X! eX P ( X! °kl (T kl) 
creSjv(l->i) k^l.l 



In above Sa^I — > i) denotes the set of all permutations in Sn that map 1 to i: 

S N {1 ^ i) = {a £ S N : a{l) = i}. 
Given this, it follows that 



(4.15) 



p 


^(l) = i) 


exp(6» li ) 


£<TeS»r(l-X) ex P (E^l,l 6 kl°kl) 


p(<r 2 (i) = i 


exp(6» H /) 


J2 P eS N (i^i') ex P ( T, kj ti,l 6 kWki) 



Next, we will consider a one-to-one and onto map from Sat(1 — > i) to 5;v(l — > i') 
(which are of the same cardinality). Under this mapping, suppose a £ Sjv(l — > 
is mapped to p £ Sjv(l — >• i'). Then we shall have that 



(4.16) 



exp ( ^2 VkiOkij < y/logN exp ( pki®ki\ ■ 

kl kl 



This, along with (4.151 will imply that 

P(tr a (l) = 



(4.17) 



< v/logiV. 



P(a 2 (l)=^ 

This in turn implies that for any i, P((T2(1) = i) < \/log N /N, which we shall use 
in bounding (4.13). 

To that end, we consider the following mapping from <Sjv(1 —> i) to Sn(1 — > i'). 
Consider a a £ Sjy(l — > i). By definition er(l) = i. Let q be such that a(q) = i' . 
Then map a to p 6 S^(l — > i') where p{\) = i', p(q) = i and p(k) = a(k) for 
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k 7^ l,q. Then, 

exp( 



ex P (J2kl Pktfkl 



cxp IV U , 



(4.18) < yiogiV, 



where the last inequality follows from condition (3.2 1 in the statement of Theo- 
rem [33] From the above discussion, we conclude that 

T-l 



*?) =1-P(a 2 (l)^ 

VlogiV^ i 



(4.19) <1 

For j > 2, in order to evaluate P(F?| f^lj i^ c ), we evaluate 1 - P(i^| f^lj F £ c ). 

To evaluate P(i^-| Ffy, note that the conditioning event PpfZ^Ff suggets that 
for each cr t , 2 < t < T, some assignments (ranks) for first j ' — 1 items are given 
and we need to find the probability that the jth item of each of the 02 , . . . , <Jt 
are not mapped to L. Therefore, given H/Z^i* 1 ^, we wish to evaluate (an upper 
bound on) probability of 02 (j) mapping ij given that we know assignments of 
02(1), • • • j a 2(j ~ 1)- By the form of the exponential family, conditioning on the 
assignments 02(1), . . . , 02 (j — 1), effectively we have an exponential family on the 
space of permutations of the remaining N—j+1 elements. And with respect to that, 
we wish to evaluate bound on the marginal probability of cr 2 (j) mapping to ij. By 
an argument identical to the one used above to show that P(ct 2 (1) = i) < -\/\ogN /N, 
it follows that 



(4.20) < 



< ,/bglV 



N-j+1 
2 v / IbglV 



N ' 

where we have used the fact that j < L = N s < N/2 (for TV large enough). 
Therefore, it follows that 

2VloglVNT-i 



< 



1-1- 



N 

From here on, using arguments identical to those used above (under MNL model), 



(4.21) 

From here on, usi 
we conclude that 

P(#i c ) <exp(-6(7V <5/2 ) 
(4.22) <0(l/iV 2 ). 



This completes the proof for max-ent exponential family with condition (3.2 1 and 
hence that of Theorem 13.31 



4.3. Proof of Theorem |3.2[ We are given a doubly-stochastic observation matrix 
D. Suppose there exists a choice model [i such that it satisfies signature condition, 
||/i||o = K and ||M(/z) — D\\2 < e. Then, the algorithm we describe below finds 
a choice model A such that ||A|| = 0(e~ 2 KlogN), \\M(X) - < 2e in time 

exp(Q(K logiV)). This algorithm requires effectively searching over space of choice 
models from signature family. Before we can describe the algorithm, we introduce 



SPARSE CHOICE MODELS 



17 



a representation of the models in the signature family, which allows us reduce the 
problem into solving a collection of linear programs (LPs) . 

Representation of signature family. We start by developing a representation of 
choice models from the signature family that is based on their first order marginal 
information. All the relevant variables are represented by vectors in N 2 dimension. 
For example, the data matrix D — is represented as an TV 2 dimensional vector 
with components indexed by tuples for the ease of exposition: Dij will be denoted 
as Duj\ and the dimensions will be ordered as per lexicographic ordering of the 
tuple, i.e. < (i',f) iff i < i' or i = i' and j < j' . Therefore, D in column 

vector form is 

D = [D(i.i) -D(l,2) • ■ ■ -D(1,JV) -D(2,i) • • • -D(at,ao] T - 

In a similar manner, we represent a permutation a G Sn as a 0-1 valued iV 2 
dimensional vector as a = [(fu^)] with cruj-j = 1 if a(i) = j and otherwise. 

Now consider a choice model in the signature family with support K. Suppose it 
has the support tr 1 , . . . , a K with their respective probabilities p±, . . . ,Pk- Since the 
model belongs to the signature family, the K permutations have distinct signature 
components. Specifically, for each k, let {ik,jk) be the signature component of 
permutation a k so that er fc (i fc ) = jk (i.e. <r£ ^ \ = 1) but a k (ik) ^ jk (i.e. 

a (i j k ) = °) for a11 k ' ^ k ' 1 - k ' - K - Now let M = \. M (i,j)] be the fi i*st-order 
marginals of this choice model. Then, it is clear from our notation that M^ ik j k ^ = pk 
for 1 < k < K and for any other 1 < i,j < N, ^ is a summation of a 

subset of the K values pi , . . . , pk ■ 

The above discussion leads to the following representation of a choice model 
from the signature family. Each choice model is represented by an iV 2 x iV 2 matrix 
with 0-1 entries, say Z = [Z {iJ){i , for 1 < < N: in Z {iJ){i , 

represents a row index while («', j') represents a column index. The choice model 
with support K is identified with its K signature components (ik,jk), 1 < k < K. 
The corresponding Z has all TV 2 — K columns corresponding to indices other than 
these K tuples equal to 0. The columns corresponding to the (ik,jk) indices, 
1 < k < K, are non-zero with each representing a permutation consistent with the 
signature condition: for each (ik,jk), 1 < k < K, 

(4.23) %i)(H,j fc ) G {0, 1}, for all 1 < i, j < N, 

(4-24) Z {iktjk){ikdk) = 1, 

(4.25) %i)(u,j fc ) = 0, if (*, j) € {(i fc ', j*0 :l<k'<K,k'^ k}, 

N N 

(4.26) Z HMi k M = 1. E Z m(i k M = 1. fOT a11 l<iJ<N. 



Observe that ( 4.24 )-( 4.25) enforce the signature condition while (4.26) enforces the 



permutation structure. In summary, given a set of K distinct pairs of indices, 



(ik,jk), 1 < k < K with 1 < i kl jk < N, fl4.23| -( |426| represent the set of all 
possible signature family with these indices as their signature components. 

Notice now that given the above representation, the problem of finding a choice 
model of support K within the signature family that is within an e-ball of the 
observed first-order marginal data, D, may be summarized as finding a Z satisfying 
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(4.23 1- (4.26 1 and in addition, satisfying \\D — ZD\\2 < e. The remainder of this 



section will be devoted to solving this problem tractably. 



Efficient representation of signature family. A signature family choice model 
with support K can, in principle, have any K of the N 2 possible tuples as its 
signature components. Therefore, one way to search the signature family for choice 
models is to first pick a set of K tuples (there are ( K ) such sets) and then for 
that particular set of K tuples, search among all Zs satisfying (4.23 1-( 4.26 1 . It 



will be the complexity of this procedure that essentially drives the complexity of 
our approach. To this end we begin with the following observation: the problem 



of optimizing a linear functional of Z subject to the constraints ( 4.23 1-( 4.26 1 is 
equivalent to optimizing the functional over the constraints 



)€[0,1], for all 1 < i, j < TV, 



( 4 - 27 ) z (i,f)(i k ,j k 
( 4 - 28 ) Z (ik,jk)(ik,jk) = !> 

(4.29) %j)(iw fc ) = 0, if G {(**', j*0 : 1 < k' < K, k' + k}, 

N N 

(4.30) Yl %*)(W*) = 1. E Z (^M^) = 1. fOT a11 l<i,3<N. 



f=i 



It is easy to see that the points described by the set of equations (4.23 )-( 4.26 ) 



are contained in the polytope above described by equations ( 4.27 1-( 4.30 1 . Thus, 



in order to justify our observation, it suffices to show that the polytope above is 



the convex hull of points satisfying (4.23 1- ( 4.26 1 . But this again follows from the 
Birkhoff-Von Neumann theorem. 



Searching the signature family. We now describe the main algorithm that will 
establish the result of Theorem |3.2| The algorithm succeeds in finding a choice 
model A with sparsity ||A|| = 0(e~ 2 K log N) and error ||M(A) - D||oo < 2e if 
there exists a choice model [i in signature family with sparsity K that is near 
consistent with D in the sense that ||MQu) — -D||oo < £ (note that || • ||2 < || • 
Hoc). The computation cost scales as exp (6(iflog N)). Our algorithm uses the so 
called Multiplicative Weights algorithm utilized within the framework developed by 
Plotkin, Shmoys and Tardos [2] for fractional packing (also see [T]). 

The algorithm starts by going over all possible ) subsets of possible signature 
components in any order till desired choice model A is found or all are exhausted. 
In the latter case, we declare the infeasibility of finding a K sparse choice model 
in the signature family that is near consistent. Now consider any such set of K 
signature components, (ik,jk) with 1 < k < K . By the definition of the signature 
family, the values -D(i fc , Jfc ) for 1 < k < K are probabilities of the K permutations in 
the support. Therefore, we check if 1 — e < ^2^ =1 D^ ik j k ^ < 1 + e. If not, we reject 
this set of K tuples as signature components and move to the next set. If yes, we 
continue towards finding a choice model with these K as signature components and 
the corresponding probabilities. 



The choice model of interest to us, and represented by a Z satisfying (4.23) 



(4.26 1, should be such that D » ZD. Put another way, we are interested in finding 
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a Z such that 



K 



(4.31) D iid) -e <^2z {lJKlkJk) D {lkJk) < D {lJ) 



-e, for all 1 < i,j < N 2 



fe=i 



(4.32) 



Z satisfies (4.23) - (4.26) 



This is precisely the setting considered by Plotkin-Shmoys-Tardos [44] : Z is required 
to satisfy a certain collection of 'difficult' linear inequalities (4.31) and a certain 



other collection of 'easy' convex constraints (4.32) (easy, since these constraints can 
be replaced by (4.27 1-( 4.30 ) which provide a relaxation with no integrality gap as 



discussed earlier). If there is a feasible solution satisfying (4.31 1-( 4.32 1, then [4"4"] 
finds a Z that satisfies (4.311 approximately and (4.321 exactly. Otherwise, the 
procedure provides a certificate of the infeasibility of the above program; i.e. a 
certificate showing that no signature choice model approximately consistent with 
the data and with the K signature components in question exists. We describe the 
precise algorithm next. 

For ease of notation, we denote the choice model matrix Z of dimension iV 2 x iV 2 
(effectively N 2 x K) by a vector z of KN 2 dimension; we think of (4.311 as 2N 2 
inequalities denoted by Az > b with A being a 27V 2 x KN 2 matrix and b being a 
2iV 2 dimensional vector. Finally, the set of z satisfying (4.32), is denoted V. Thus, 
we are interested in finding zEf such that Az > 6. 

The framework in [44] essentially tries to solve the Lagrangian relaxation of 
Az > b over z € V in an iterative manner. To that end, let pe be the Lagrangian 
variable (or weight) parameter associated with the £th constraint ajz > be. for 
1 < I < 2N 2 (where ai is the £th row of A). We update the weights iteratively: let 
t € {0, 1, ... } represent the index of the iteration. Initially, t = and pe(0) = 1 for 
all £. Given p(t) = \pt(t)], we find z t by solving the linear program 



(4.33) 



maximize V^|>i (t){ a i 



over z 6 co(V). 



be) 



Notice that by our earlier discussion, 00(7^) is the polyhedron defined by the lin- 



ear inequalities (4.27 1-( 4.30 ), so that optimal basic solutions to the LP above are 



optimal solutions to the optimization problem obtained if one replaced co('P) with 
simply V . Now in the event that the above LP is infcasiblc, or else if its opti- 
mal value is negative, we declare immediately that there does not exist a if-sparse 
choice model with the K signature components in question that is approximately 
consistent with the observed data; this is because the above program is a relaxation 



to (4.31 ) — (4.32 1 in that (4.31 1 has been relaxed via the 'lagrange' multiplier p{t). 



Further, if the original program were feasible, then our LP should have a solution 
of non-negative value since the weights p(t) are non-negative. Assuming, we do 
not declare infeasibility, the solution zt obtained is a if-sparse choice model whose 
signature components correspond to the K components we began the procedure 
with. 

Assuming that the linear program is feasible, and given an optimal basic feasible 
solution z t , the weights p(t + 1) arc obtained as follows: for S = min (e/8, 1/2), we 
set: 



(4.34) 



pe(t+l)= P e(l-S(ajz t -be)) 
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The above update (4.34 1 suggests that if the £th inequality is not satisfied, we should 
increase the penalty imposed by pi(t) (in proportion to the degree of violation) or 
else, if it is satisfied, we decrease the penalty imposed by pi(t) in proportion to the 
'slack' in the constraint. Now, aj z* — bi £ [—2,2]. To see this note that: First, 
bp G [0, 1] since it corresponds to an entry in a non-negative doubly stochastic 
matrix D. Further, aj V e [0,1 + e] since it corresponds to the summation of a 
subset of K non-negative entries D^ ik Jfc ), 1 < k < K and by choice we have made 
sure that the sum of these K entries is at most 1 + e. Hence, the multiplicative 
update to each of the pe(-) is by a factor of at most (1 ± 28) in a single iteration. 
Such a bound on the relative change of these weights is necessary for the success of 
the algorithm. 

Now, assume we have not declared infeasibility for all t < T and consider the 
sequence of solutions, z*. Further, set T = 64e -2 ln(27V 2 ) = (3(£~ 2 logA), and 
define z = ^ Y!n=o z *- Then, we have via Corollary 4 in pQ (see also, [H Section 
3.2]), that 

(4.35) ajz>b e -e, for all 1 < I < 2N 2 . 

Now z corresponds to a choice model (call it A) with support over at most O(KT) 
= Of Jr log iV) permutations since each z l is a choice model with support over K 



permutations in the signature family. Further, (4.35) implies that ||M(A) — D||oo < 
2e. 

Finally, note that the computational complexity of the above described algo- 
rithm, for a given subset of K signature components is polynomial in N . Therefore, 
the overall computational cost of the above described algorithm is dominated by 
term { N K ) which is at most N 2K . That is, for any K > 1, the overall computation 
cost of the algorithm is bounded above by exp (0(K log iV)). This completes the 
proof of Theorem |3.2| 



Utilizing the algorithm. It is not clear a priori if for given set of first-order 
marginal information, D, there exists a signature family of sparsity K within some 
small error e > with £ < £o where £o is the maximum error we can tolerate. The 
natural way to adapt the above algorithm is as follows. Search over increasing values 
of K and for each K search for e = £q. For the first K for which the algorithm 
succeeds, it may be worth optimizing over the error allowed, e, by means of a 
binary search: £o/2,£o/4, • ■ ■ ■ Clearly such a procedure would require 0(log 1/e) 
additional run of the same algorithm for the given K, where £ is the best precision 
we can obtain. 



5. An empirical study 
This Section is devoted to answering the following, inherently empirical, question: 

Can sparse choice models fit to limited information about the underlying 'true' 
choice model be used to effectively uncover information one would otherwise un- 
cover with ostensibly richer data? 

In this section, we describe an empirical study we conducted that supports an 
affirmative answer to the above question. For the purpose of the study, we used 
the well-known APA (American Psychological Association) dataset that was first 
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used by |19j in order to demonstrate the underlying structure one can unearth by 
studying the appropriate lower-dimensional 'projections' of choice models, which 
include first and second order marginals. 

Specifically, the dataset comprises the ballots collected for electing the president 
of the APA. Each member expresses her/his preferences by rank ordering the can- 
didates contesting the election. In the year under consideration, there were five 
candidates contesting the election and a total of 5,738 votes that were complete 
rankings. This information yields a distribution mapping each permutation to the 
fraction of voters who vote for it. Given all the votes, the winning candidate is 
determined using the Hare system (see [33] for details about the Hare system). 

A common issue in such election systems is that it is a difficult cognitive task for 
voters to rank order all the candidates even if the number of candidates is only five. 
This, for example, is evidenced by the fact that out of more than 15,000 ballots cast 
in the APA election, only 5,738 of them are complete. The problem only worsens 
as the number of candidates to rank increases. One way to overcome this issue is to 
design an election system that collects only partial information from members. The 
partial information still retains some of the structure of the underlying distribution, 
and the loss of information is the price one pays for the simplicity of the election 
process. For example, one can gather first-order partial information i.e., the fraction 
of people who rank candidate i to position r. As discussed by [B5], the first-order 
marginals retain useful underlying structure like: (1) candidate 3 has a lot of "love" 
(28% of the first-position vote) and "hate" (23% of the last-position vote) vote; (2) 
candidate 1 is strong in second position (26% of the vote) and low hate vote (15% 
of last-position vote); (3) voters seem indifferent about candidate 5. 

Having collected only first order information, our goal will be to answer natural 
questions such as: who should win the election? or what is the 'socially preferred' 
ranking of candidates? Of course, there isn't a definitive manner in which the 
above questions might be answered. However, having a complete distribution over 
permutations affords us the flexibility of using any of the several rank aggregation 
systems available. In order to retain this flexibility, we will fit a sparse distribution 
to the partial information and then use this sparse distribution as input to the 
rank aggregation system of choice to determine the 'winning' ranking. Such an 
approach would be of value if the sparse distribution can capture the underlying 
structural information of the problem at hand. Therefore, with an aim to under- 
standing the type of structure sparse models can capture, we first considered the 
first-order marginal information of the dataset (or distribution). We let A denote 
the underlying "true" distribution corresponding to the 5,738 complete rankings of 
the 5 candidates. The 5x5 first-order marginal matrix D is given in Table [l] 

For this D, we ran a heuristic version of the algorithm described in Section |4. 3 1 
Roughly speaking, the heuristic tries to find in a greedy manner a sparse choice 
model in the signature family that approximates the observed data. It runs very 
fast (polynomial in N) and seems to provide approximations of a quality guaranteed 



by the algorithm in Section 4.3 However, we are unable to prove any guarantees 
for it. To keep the exposition simple, and to avoid distraction, we do not describe 
the heuristic here but simply refer the interested reader to [28] , Using the heuristic, 
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Candidate 
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3 
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18 


26 


23 


17 


15 
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14 


19 


25 


24 


18 


3 


28 


17 


14 


18 


23 


4 


20 


17 


19 


20 


23 


5 


20 


21 


20 


19 


20 



Table 1. The first-order marginal matrix where the entry corre- 
sponding to candidate i and rank j is the percentage of voters who 
rank candidate i to position j 



we obtained the following sparse model A: 

24153 0.211990 
32541 0.202406 
15432 0.197331 
43215 0.180417 
51324 0.145649 
23154 0.062206 

In the description of the model A above, we have adopted the notation used in [15] 
to represent each rank-list by a five-digit number in which each candidate is shown 
in the position it is ranked i.e., 24153 represents the rank-list in which candidate 2 
is ranked at position 1, candidate 4 is ranked at position 2, candidate 1 is ranked at 
position 3, candidate 5 is ranked at position 4, and candidate 3 is ranked at position 
5. Note that the support size of A is only 6, which is a significant reduction from the 
full support size of 5! = 120 of the underlying distribution. The average relative 
error in the approximation of M by the first-order marginals M(X) is less than 
0.075, where the average relative error is defined as 

^ Da 

Note that this measure of error, being relative, is more stringent than measuring 
additive error. The main conclusion we can draw from the small relative error we 
obtained is that the heuristic we used can successfully find sparse models that are 
a good fit to the data in interesting practical cases. 

5.1. Structural Conclusions. Now that we have managed to obtain a huge re- 
duction in sparsity at the cost of an average relative error of 0.075 in approximating 
first-order marginals, we next try to understand the type of structure the sparse 
model is able to capture from just the first-order marginals. More importantly, we 
will attempt to compare these conclusions with conclusions drawn from what is 
ostensibly 'richer' data: 



Comparing CDFs: We begin with comparing the 'stair-case' curves of the cu- 
mulative distribution funct ions (CDF) of the actual distribution A and the sparse 
approximation A in Figure 5.1 Along the x-axis in the plot, the permutations are 
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ordered such that nearby permutations are "close" to each other in the sense that 
only a few transpositions (pairwise swaps) are needed to go from one permutation 
to another. The figure visually represents how well the sparse model approximates 
the true CDF. 

Now, one is frequently interested in a functional of the underlying choice model 
such as determining a winner, or perhaps, determining a socially preferred ranking. 
We next compare conclusion drawn from applying certain functionals to the sparse 
choice model we have learned with conclusions drawn from applying the same func- 
tional to what is ostensibly richer data: 

Winner Determination: Consider a functional of the distribution over rank- 
ings meant to capture the most 'socially preferred' ranking. There are many such 
functionals, and the Hare system provides one such example. When applied to the 
sparse choice model we have learned this yields the permutation 13245. Now, one 
may use the Hare system to determine a winner with all of the voting data. This 
data is substantially richer than the first order marginal information used by our ap- 
proach. In particular, it consists of 5,738 votes that consists of entire permutations 
of the candidates (from which our first order marginal information was derived), 
and approximately 10,000 additional votes for partial rankings of the same candi- 
dates. Applying the Hare system here also yields 1 as the winning candidate as 
reported by [T§] . 

Rank Aggregation: In addition to determining a winner, the Hare system 
applied to a choice model also yields an 'aggregate' permutation which, one may 
argue, represent the aggregate opinions of the population in a 'fair' way. Now, 
as reported above, the Hare system applied to our sparse choice model yields the 
permutation 13245. As it turns out, this permutation is in remarkable agreement 
with conclusions drawn by Diaconis using higher order partial information derived 
from the same set of 5, 738 votes used here. In particular, using second-order mar- 
ginal data, i.e. information on the fraction of voters that ranked candidates {i,j} 
to positions {k, 1} (without accounting for order in the latter set) for all distinct 
i,j,k,l yields the following conclusion, paraphrased from |19j : There is a strong 
effect for candidates {1,3} to be ranked first and second and for candidates {4,5} 
to be ranked fourth and fifth, with candidate 2 in the middle. Diaconis goes on to 
provide some color to this conclusion by explaining that voting is typically along 
partisan lines (academicians vs. clinicians) and as such these groups tend to fall 
behind the candidate groups {1,3} and {4,5}. Simultaneously, these candidate 
groups also receive 'hate' vote wherein they are voted as the least preferred by the 
voters in the opposing camp. 2 is apparently something of a compromise candidate. 
Remarkably, we have arrived at the very same permutation using first order data. 

Sparse Support Size: It is somewhat tantalizing to relate the support size (6) of 
the sparse choice model learned with the structure observed in the dataset by Di- 
aconis [IH] discussed in our last point: there are effectively three types (groups) of 
candidates, viz. {2}, {1, 3} and {4, 5}, in the eyes of the partisan voters. Therefore, 
all votes effectively exhibit an ordering/preference over these three groups primar- 
ily and therefore effectively the votes are representing 3! = 6 distinct preferences. 
This is precisely the size of the support of our sparse approximation; of course, this 
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Figure 5.1. Comparison of the CDFs of the true distribution and 
the sparse approximation we obtain for the APA dataset. The pr- 
axis represents the 5! = 120 different permutations ordered so that 
nearby permutations are close to each other with respect to the 
pairwise transposition distance. 

explanation is not perfect since the permutations in the choice model learned split 
up these groups. 



6.1. Summary. Choice models are an integral part of various important decision 
making tasks. Sparse approximations to the underlying true choice model based on 
(information-) limited observed data are particularly attractive as they are simple 
and hence easy to integrate into complex decision making tasks. In addition, learn- 
ing sparse models from marginal data provides a nonparametric approach to choice 
modeling. This paper, in a sense, has taken important steps towards establishing 
sparse choice model approximation as a viable option. 

As the first main result, we showed that for first-order information, if we are 
willing to allow for an error of e in approximating a given doubly stochastic 
matrix (that is, noisy observations of first-order marginal information), then there 
exists a choice model with sparsity O(Nfe) that is an e-fit to the data. Note that 
this is a significant reduction from Q(N 2 ) that is guaranteed by the Birhkoff-von 
Neumann and Caratheodory's theorems. Given that we can expect to find sparse 
models, we considered the issue of efficient recovery of sparse models. We showed 
that as long as there is a choice model A of sparsity K in the signature family that 
is an e-fit to the data, we can find a choice model of sparsity O(KlogN) that is a 



6. Discussion 



SPARSE CHOICE MODELS 



25 



2e-fit to the data in time 0(exp(Q(K log N))) as opposed to the brute-force time 
complexity of 0(exp(Q(K N log N))) . The computational efficiency is achieved by 
means of the 'signature condition'. In prior work, this condition was shown to 
be useful in learning an already sparse choice model from its noise-free first order 
marginals. This work establishes, in a sense, the robustness of these conditions. 
Finally, we demonstrated the ubiquity of the signature family by showing that it is 
appropriately "dense" for a large class of models. 

In the recently popular compressive sensing literature, the restricted null space 
condition has been shown to be necessary and sufficient for efficient learning of 
sparse models via linear programs. It was shown in the past that this restricted 
null space condition (or effectively a linear programming relaxation of our problem) 
is ineffective in learning sparse choice models. In that sense, this work shows that 
'signature conditions' are another set of sufficient conditions that help learn sparse 
choice models in a computationally efficient fashion. 

6.2. Beyond first-order marginals. Here we discuss the applicability of the re- 
sults of this work beyond first-order marginal information. The proof for the result 



(Theorem 3.1 1 on "how sparse the sparse models are" does not exploit the structure 
of first-order marginals and hence can be extended in a reasonably straightforward 
manner to other types of marginal information. Similarly, we strongly believe the 



result (Theorem 3.3 1 that we can find good approximations in the signature family 
for a large class of choice models extends to other types of marginal information 
(see [3U] for the basis of our belief). However, the result about computational ef- 



ficiency (Theorem 3.2 1 strongly relics on the efficient description of the first-order 
marginal polytope by means of the Birkhoff-Von Neumann theorem and will not 
readily extend to other types of marginal information. The algorithm presented 



in Section 4.3 extends to higher order marginals with possibly a computationally 
complex oracle to check the feasibility of a signature choice model with respect to 
the higher order marginal. Indeed, it would be an important direction for future 
research to overcome this computational threshold by possibly developing better 
computational approximations. The heuristic utilized in Section [5] is quite efficient 
(polynomial in N) for first-order marginals. It is primarily inspired by the exact 
recovery algorithm based on the signature condition utilized in our earlier work. We 
strongly believe that such a heuristic is likely to provide a computationally efficient 
procedure for higher order marginal data. 

6.3. Signature condition and computational efficiency. As discussed earlier, 
the signature condition affords us a computationally efficient procedure for the re- 
covery of a sparse choice model approximately consistent with first order marginal 
information for a broad family of choice models. This is collectively established by 
Theorems |3.2| and |3.3| Specifically, the computational speedup relative to brute- 
force search is significant. However, it is worth asking the question whether alter- 
native algorithms that do not rely on the signature condition can provide a speedup 
relative to brute-force search. As it turns out, the following can be shown: Assume 
there exists a sparse choice model, with sparsity K, that approximates the observed 
first-order marginals (i.e. doubly stochastic matrix) within accuracy e. In this case, 
we can recover a sparse choice model (not necessarily in the signature family) with 
sparsity 0(e~ 2 K log N) that approximates the observed first-order marginals within 
an error of O(e) . We can recover this model in time f i j x exp log if) . Given 
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Theorem 3.1 and the following discussion, it is reasonable to think of K = Q(N). 

Therefore, this computational cost is effectively worse by a factor of ( j) compared 
to that of our approach using the signature condition in Section 4.3 This inferiority 
notwithstanding, it is worth describing the simple heuristic. Before doing so it is 
important to note however that this alternate heuristic has no natural generaliza- 
tion to observed data outside of the realm of first-order marginal information. In 
contrast, the approach we have followed, by relying on the structure afforded by the 
signature family, suggests a fast (polynomial in N) heuristic that can potentially 
be applied for many distinct types of marginal data; this is the very heuristic em- 
ployed in Section [5] Establishing theoretical guarantees for this heuristic remain 
an important direction for future research. 

Now we provide a brief description of the algorithm hinted at above. The al- 
gorithm is similar to that described in Section |4.3| Specifically, it tries to find K 
permutations and their associated probabilities, so that the resulting distribution 
has first-order marginals that are near consistent with the observations. Now the 
K unknown permutations are represented through their linear relaxations implied 
by the Birkhoff-Von Neumann result, i.e. (4.27) and (4.301. Under the signature 
condition, the associated probabilities were discovered implicitly by means of (4.28) 
and (4.291. However, without the signature condition, the only option we have is 
to search through all possible values for these probabilities. Since we are interested 
in approximation accuracy of e it suffices to check \K/e) such probability vec- 
tors. For a given such probability vector, we are left with the problem of searching 
for K permutations with these probabilities that have their corresponding first- 
order marginals well approximated by the observed data. This again fits into the 
framework of Plotkin, Shmoys and Tardos as discussed in Section |4.3| Therefore, 
using similar ideas described there, we can find a sparse choice model with sparsity 
0(e~ 2 K log N) efficiently (within time polynomial in N) if there existed a sparse 
model with sparsity K and the particular quantized probability vector that approx- 
imated the observations sufficiently well. Since the algorithm will start search over 
increasing values of K and for a given K, over all 0((K/e) K ) distinct probability 
vectors, the effective computation cost will be dominated by the largest K value 
encountered by the algorithm. This effectively completes the explanation of the 
algorithm and its computational cost. 



References 

[1] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta algorithm 

and applications. Manuscript, 2005. 
[2] K. Bartels, Y. Boztug, and M. M. Muller. Testing the multinomial logit model. Working 

Paper, 1999. 

[3] M. E. Ben-Akiva. Structure of passenger travel demand models. PhD thesis, Department of 

Civil Engineering, MIT, 1973. 
[4] M. E. Ben-Akiva and S. R. Lerman. Discrete choice analysis: theory and application to travel 

demand. CMIT press, Cambridge, MA, 1985. 
[5] R. Beran. Exponential models for directional data. The Annals of Statistics, 7(6):1162-1178, 

1979. 

[6] R. Berinde, A. C. Gilbert, P. Indyk, H. Karloff, and M. J. Strauss. Combining geometry and 
combinatorics: A unified approach to sparse signal recovery, pages 798 —805, sep. 2008. 

[7] G. Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucuman Rev. Ser. A, 
5:147-151, 1946. 



SPARSE CHOICE MODELS 



27 



J. H. Boyd and R. E. Mellman. The effect of fuel economy standards on the u.s. automotive 
market: An hedonic demand analysis. Transportation Research Part A: General, 14(5-6):367 
- 378, 1980. 

R. A. Bradley. Some statistical methods in taste testing and quality evaluation. Biometrics, 
9:22-38, 1953. 

E. J. Candes and J. Romberg. Quantitative robust uncertainty principles and optimally sparse 
decompositions. Foundations of Computational Mathematics, 6(2):227-254, 2006. 
E. J. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal recon- 
struction from highly incomplete frequency information. IEEE Transactions on Information 
Theory, 52(2):489-509, 2006. 

E. J. Candes, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and inac- 
curate measurements. Communications on Pure and Applied Mathematics, 59(8), 2006. 
E. J. Candes and T. Tao. Decoding by linear programming. Information Theory, IEEE 
Transactions on, 51(12):4203-4215, 2005. 

N. S. Cardcll and F. C. Dunbar. Measuring the societal impacts of automobile downsizing. 
Transportation Research Part A: General, 14(5-6) :423 - 434, 1980. 

G. Cormode and S. Muthukrishnan. Combinatorial algorithms for compressed sensing. Lecture 
Notes in Computer Science, 4056:280, 2006. 

B. Crain. Exponential models, maximum likelihood estimation, and the haar condition. Jour- 
nal of the American Statistical Association, 71:737-745, 1976. 

G. Debreu. Review of r. d. luce, 'individual choice behavior: A theoretical analysis'. American 
Economic Review, 50:186-188, 1960. 

P. Diaconis. Group representations in probability and statistics. Institute of Mathematical 
Statistics Hayward, CA, 1988. 

R Diaconis. A generalization of spectral analysis with application to ranked data. The Annals 
of Statistics, 17(3):949-979, 1989. 

D. L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289- 
1306, 2006. 

V. Farias, S. Jagabathula, and D. Shah. A data-driven approach to modeling choice. In Neural 
Information Processing Systems, 2009. 

V. Farias, S. Jagabathula, and D. Shah. A non-paramctric approach to modeling choice with 
limited data. Submitted to Management Science, 2009. 

P.C. Fishburn and S.J. Brams. Paradoxes of preferential voting. Mathematics Magazine, 
56(4):207-214, 1983. 

R. Gallagcr. Low-density parity-check codes. Information Theory, IRE Transactions on, 
8(l):21-28, 1962. 

A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin. One sketch for all: fast algo- 
rithms for compressed sensing. In STOC '07: Proceedings of the thirty-ninth annual ACM 
symposium on Theory of computing, pages 237—246, New York, NY, USA, 2007. ACM. 
P. M. Guadagni and J. D. C. Little. A logit model of brand choice calibrated on scanner data. 
Marketing science, 2(3):203-238, 1983. 

J. L. Horowitz. Semiparametric estimation of a work-trip mode choice model. Journal of 
Econometrics, 58:49-70, 1993. 

S. Jagabathula. Nonparametric Choice Modeling: Applications to Operations Management. 
PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2011. 
S. Jagabathula and D. Shah. Inferring rankings under constrained sensing. In NIPS, 2008. 
S. Jagabathula and D. Shah. Inferring rankings under constrained sensing. IEEE Transaction 
on Information Theory, 2011. 

BO Koopman. On distributions admitting a sufficient statistic. Transactions of the American 
Mathematical Society, 39(3):399-409, 1936. 

M. G. Luby, M. Mitzcnmachcr, M. A. Shokrollahi, and D. A. Spiclman. Improved low- 
density parity-check codes using irregular graphs. IEEE Transactions on Information Theory, 
47(2):585-598, 2001. 

R.D. Luce. Individual choice behavior: A theoretical analysis. Wiley, New York, 1959. 
S. Mahajan and G. J. van Ryzin. On the relationship between inventory costs and variety 
benefits in retail assortments. Management Science, 45(11):1496-1509, 1999. 
[35] J.I. Mardcn. Analyzing and modeling rank data. Chapman & Hall/CRC, 1995. 



SPARSE CHOICE MODELS 



28 



[36] J. Marschak. Binary choice constraints on random utility indicators. Cowles Foundation 

Discussion Papers, 1959. 
[37] J. Marschak and R. Radner. Economic Theory of Teams. Yale University Press, New Haven, 

CT, 1972. 

[38] D. McFadden. Conditional logit analysis of qualitative choice behavior. Frontiers in Econo- 
metrics, P. Zarembka (ed.), pages 105-142, 1973. 

[39] D. McFadden. Econometric models of probabilistic choice, in "Structural Analysis of Discrete 
Data with Econometric Applications, "(CF Manski and D. McFadden, Eds.), 1981. 

[40] D. McFadden. Disaggregate behavioral travel demands rum side. Travel Behaviour Research, 
pages 17-63, 2001. 

[41] D. McFadden and K. Train. Mixed MNL models for discrete response. Journal of Applied 

Econometrics, 15(5):447-470, September 2000. 
[42] H. Nyquist. Certain topics in telegraph transmission theory. Proceedings of the IEEE, 

90(2):280-305, 2002. 

[43] RL Plackett. The analysis of permutations. Applied Statistics, 24(2):193-202, 1975. 

[44] S.A. Plotkin, D.B. Shmoys, and E. Tardos. Fast approximation algorithms for fractional 

packing and covering problems. In IEEE FOCS, 1991. 
[45] I. S. Reed and G. Solomon. Polynomial codes over certain finite fields. Journal of the Society 

for Industrial and Applied Mathematics, pages 300-304, 1960. 
[46] C. E. Shannon. Communication in the presence of noise. Proceedings of the IRE, 37(1):10-21, 

1949. 

[47] M. Sipser and D. A. Spielman. Expander codes. IEEE Transactions on Information Theory, 
42:1710-1722, 1996. 

[48] L. Thurstone. A law of comparative judgement. Psychological Reviews, 34:237—286, 1927. 
[49] J. A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions 

on Information Theory, 50(10) :2231-2242, 2004. 
[50] J. A. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. 

IEEE transactions on information theory, 52(3):1030-1051, 2006. 
[51] J. von Neumann. A certain zero-sum two-person game equivalent to the optimal assignment 

problem. In Contributions to the theory of games, 2, 1953. 
[52] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational 

inference. Foundations and Trends® in Machine Learning, l(l-2):l-305, 2008. 
[53] J. I. Yellott. The relationship between luce's choice axiom, thurstone's theory of comparative 

judgment, and the double exponential distribution. Journal of Mathematical Psychology, 

15(2):109 - 144, 1977. 



